L10: Merging Data

Bogdan G. Popescu

John Cabot University

Aggregating by group

In some instances, we might be interested in performing operations by group.

For example, we might be interested in counting the number of observations by group.

Let’s say that we want to know how many beers there are per beer style.

The following will create a group object.

Aggregating by group

Let us have a look at a mock dataset.

Python
import pandas as pd
beer = pd.read_csv('beer_reviews.csv')
beer.columns = ["name_beer", "id_brewery", "name_brewery", "overall_review",
"aroma_review", "appearance_review", "beer_style", "review_taste", "beerid"]
beer.head()
   name_beer               id_brewery  ...            review_taste  beerid
0      10325          Vecchio Birraio  ...            Sausa Weizen   47986
1      10325          Vecchio Birraio  ...                Red Moon   48213
2      10325          Vecchio Birraio  ...  Black Horse Black Beer   48215
3      10325          Vecchio Birraio  ...              Sausa Pils   47969
4       1075  Caldera Brewing Company  ...           Cauldron DIPA   64883

[5 rows x 9 columns]

Aggregating by group

Python
grouped_by_style = beer.groupby(by='beer_style')
grouped_by_style
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x17c9d8920>

If we actually want to see the number of beers by style, we would do the following

Python
grouped_by_style.size()
beer_style
1.0      58
1.5     102
2.0     253
2.5     258
3.0     507
3.5    1137
4.0    1999
4.5    1390
5.0     294
dtype: int64

Aggregating by group

We could also rearrange the number of beers by style in the following way:

Python
#Creating a dataframe
df = pd.DataFrame(grouped_by_style.size())
#Renaming the column
df = df.rename(columns={0: 'count'})
#Sorting
df_sorted = df.sort_values(by='count', ascending=False)
df_sorted
            count
beer_style       
4.0          1999
4.5          1390
3.5          1137
3.0           507
5.0           294
2.5           258
2.0           253
1.5           102
1.0            58

Aggregating by group

If we are interested in the average review score for every beer style, we would do:

Python
a = beer.groupby('beer_style')['overall_review'].mean()
print(a)
beer_style
1.0    1.612069
1.5    1.887255
2.0    2.318182
2.5    2.660853
3.0    3.164694
3.5    3.585752
4.0    3.940220
4.5    4.210072
5.0    4.450680
Name: overall_review, dtype: float64

Aggregating by group

We could also calculate the average review for more than one category

Python
a = beer.groupby(['beer_style', 'name_brewery'])['overall_review'].mean()
print(a)
beer_style  name_brewery
1.0         1.0             1.733333
            1.5             1.529412
            2.0             1.437500
            2.5             1.250000
            3.0             1.500000
                              ...   
5.0         3.0             4.500000
            3.5             4.600000
            4.0             4.306122
            4.5             4.400826
            5.0             4.555085
Name: overall_review, Length: 64, dtype: float64

Aggregating by group

Another option to work with grouped data is to use the agg function.

This allows us to apply multiple functions to one group.

Python
a = beer.groupby(['beer_style', 'name_brewery'])['overall_review'].agg(["mean", "count"])
print(a)
                             mean  count
beer_style name_brewery                 
1.0        1.0           1.733333     30
           1.5           1.529412     17
           2.0           1.437500      8
           2.5           1.250000      2
           3.0           1.500000      1
...                           ...    ...
5.0        3.0           4.500000      1
           3.5           4.600000      5
           4.0           4.306122     49
           4.5           4.400826    121
           5.0           4.555085    118

[64 rows x 2 columns]

Exercises

  1. What are the top 5 beer styles?
  2. What are the top 5 worst breweries?
  3. Which brewery has the greatest variability in its rating?

Exercises

  1. What are the top 5 beer styles?
Python
#Calculate the overall review by beer style
a = beer.groupby(['beer_style'])['overall_review'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(a)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=False)
df_sorted.head(5)
            avg_review
beer_style            
5.0           4.450680
4.5           4.210072
4.0           3.940220
3.5           3.585752
3.0           3.164694

Exercises

  1. What are the top 5 worst breweries?
Python
#Calculate the overall review by beer style
b = beer.groupby(['name_brewery'])['overall_review'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(b)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=True)
df_sorted.head(5)
              avg_review
name_brewery            
1.5             1.926136
1.0             1.956522
2.0             2.416667
2.5             2.702479
3.0             3.175424

Exercises

  1. Which brewery has the greatest variability in its rating?
Python
#Calculate the overall review by beer style
c = beer.groupby(['name_brewery'])['overall_review'].std()
#Identify the highest variability
max_variability_brewery = c.sort_values(ascending=False).head(1)
max_variability_brewery
name_brewery
1.0    0.822098
Name: overall_review, dtype: float64

Merging Data Introduction

We can use different functions to combine different dataframes including: concat, append, merge, and join.

Python
import pandas as pd
import numpy as np

The following graph shows all the different types of joins available:

Inner Join

In this case, we have two dataframes: x and y.

For an inner join, only the shaded area will be returned.

Outer Join

In this case, we have two dataframes: x and y.

This takes all the data from x and y

Left Join

In this case, we have two dataframes: x and y.

This takes everything from the left as a point of reference and matches everything onto the left dataframe

Right Join

In this case, we have two dataframes: x and y.

This takes everything from the right as a point of reference and matches everything onto the right dataframe

Examples

Let us say that we have the following dataframes:

Python
import pandas as pd
df_a = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b = pd.DataFrame({
    'id': [2, 3, 4],
    'city': ['New York', 'Los Angeles', 'Chicago'],
    'salary': [70000, 80000, 60000]
})
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000

Inner Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
inner_join = pd.merge(df_a, df_b, on='id', how='inner')
inner_join
   id     name  age         city  salary
0   2      Bob   30     New York   70000
1   3  Charlie   35  Los Angeles   80000

Outer Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
outer_join = pd.merge(df_a, df_b, on='id', how='outer')
outer_join
   id     name   age         city   salary
0   1    Alice  25.0          NaN      NaN
1   2      Bob  30.0     New York  70000.0
2   3  Charlie  35.0  Los Angeles  80000.0
3   4      NaN   NaN      Chicago  60000.0

Left Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
left_join = pd.merge(df_a, df_b, on='id', how='left')
left_join
   id     name  age         city   salary
0   1    Alice   25          NaN      NaN
1   2      Bob   30     New York  70000.0
2   3  Charlie   35  Los Angeles  80000.0

Right Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
right_join = pd.merge(df_a, df_b, on='id', how='right')
right_join
   id     name   age         city  salary
0   2      Bob  30.0     New York   70000
1   3  Charlie  35.0  Los Angeles   80000
2   4      NaN   NaN      Chicago   60000

Other Methods

Other methods of merging two dataframes include join and concat.

To keep things simple, we will not do those.

Exercises

Let us imagine the following datasets:

Python
raw_data_1 = { 'subject_id': ['1', '2', '3', '4', '5'], 
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = { 'subject_id': ['4', '5', '6', '7', '8'], 
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 
'test_grade': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

Exercises

  1. Create a dataframe for each data
  2. Join the first 2 DFs so that it generates a list with all the users.
  3. Join the DF from the previous section with the data from the grades (raw_data_3) so that each user is assigned their grade.
  4. Join the data from raw_data_1 and raw_data_2 only from the records with the same subject_id

Exercises Solution 1

  1. Create a dataframe for each data
Python
df1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
df2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
df3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_grade'])

Exercises Solution 2

  1. Join the first 2 DFs so that it generates a list with all the users.
Python
# Merge the DataFrames
merged_df = pd.merge(df1, df2, on=['subject_id', 'first_name', 'last_name'], how='outer')
merged_df
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          4      Billy    Bonder
5          5     Ayoung   Atiches
6          5      Brian     Black
7          6       Bran   Balwner
8          7      Bryce     Brice
9          8      Betty    Btisan
Python
df1
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
Python
df2
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan

Exercises Solution 3

  1. Join the DF from the previous section with the data from the grades (raw_data_3) so that each user is assigned their grade.
Python
left_join = pd.merge(merged_df, df3, on='subject_id', how='left')
left_join
  subject_id first_name last_name  test_grade
0          1       Alex  Anderson        51.0
1          2        Amy  Ackerman        15.0
2          3      Allen       Ali        15.0
3          4      Alice      Aoni        61.0
4          4      Billy    Bonder        61.0
5          5     Ayoung   Atiches        16.0
6          5      Brian     Black        16.0
7          6       Bran   Balwner         NaN
8          7      Bryce     Brice        14.0
9          8      Betty    Btisan        15.0

Exercises Solution 4

  1. Join the data from raw_data_1 and raw_data_2 only from the records with the same subject_id
Python
merged_df = pd.merge(df1, df2, on='subject_id', how='inner')
merged_df
  subject_id first_name_x last_name_x first_name_y last_name_y
0          4        Alice        Aoni        Billy      Bonder
1          5       Ayoung     Atiches        Brian       Black

Conclusion

Aggregation: Grouping and aggregating data is vital for summarizing patterns, including counting, averaging, and applying multiple functions across groups.

Join Types: Pandas supports diverse join operations (inner, outer, left, right) to combine datasets based on shared keys.

Data Merging: Functions like merge, join, and concat enable seamless integration of datasets, whether aligning by index or column.

Hands-On Practice: Exercises illustrate real-world scenarios for combining datasets and performing operations like filtering and summarizing merged data.